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Abstract 

Recovering three-dimensional information from two-dimensional images is the fundamental goal of 
stereo techniques. The problem of recovering depth (three-dimensional information) from a set of 
images is essentially the correspondence problem: Given a point in one image find the correspondi ng 
point in each oftheother images. Finding potential correspondences usually involves matching some 
image property. If the images are from nearby positions, they will vary only slightly, simplifying the 
matching process. 

Oncea correspondence is known, solving for the depth is simply a matter of geometry. Real images 
are composed of noisy, discrete samples, therefore the calculated depth will contain error. This error 
is a function of the baseline or distance between the images. Longer baselines result in more precise 
depths. This leads to a conflict: short baselines simplify the matching process but produce imprecise 
results; long baselines produce precise results but complicate the matching process. 

I n this paper, we present a method for generating dense depth maps from large sets (1000's) of 
images taken from arbitrary positions. Long baseline images improve the accuracy. Short baseline 
images and the large number of images greatly simplifies the correspondence problem, removing 
nearly all ambiguity. The algorithm presented is completely local and for each pixel generates an 
evidence versus depth and surface normal distribution. In many cases, the distribution contains a 
clear and distinct global maximum. Thelocation ofthispeak determi nes the depth and its shape can 
be used to estimate the error. The distribution can also be used to perform a maximum likelihood fit 
of models directly to the images. We anticipate that the ability to perform maximum likelihood esti¬ 
mation from purely local calculations will prove extremely useful in constructing three dimensional 
models from large sets of images. 
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1 Introduction 

Recovering three-dimensional information from 
two-dimensional images is the fundamental goal 
of stereo techniques. The problem of recovering 
the missing dimension, depth, from a set of im¬ 
ages is essentially the correspondence problem: 
Given a point in one image find the correspond¬ 
ing point in each oftheother images. F i ndi ng po¬ 
tential correspondences usually involves match- 
i ng some i mage property i n two or more i mages. 
If the images are from nearby positions, they will 
vary only slightly, simplifying the matching pro¬ 
cess. 


depth 


Ci baseline C'2 

Figure 1: Stereo calculation. 

Once a correspondence is known, solving for 
depth is simply a matter of geometry. Real im¬ 
ages are noisy, and measurements taken from 
them are also noisy. Figure 1 shows how the 
depth of point P can be calculated given two im¬ 
ages taken from known cameras Ci and C 2 and 
corresponding points pi and p 2 within those im¬ 
ages, which are projections of P. The location of 
Pi in the image is uncertain, as a result P can lie 
anywhere within the left cone. A similar situa¬ 
tion exists for p 2 . Ifpi andp 2 are corresponding 
points, then P could lie anywhere in the shaded 
region. Clearly, for a given depth increasing the 
baseline between Ci and C 2 will reduce the un¬ 
certainty in depth. This leads to a conflict: short 
baselines simplify the matching process, but pro¬ 
duce uncertain results; long baselines produce 
precise results, but compl icate the matchi ng pro¬ 
cess. 

One popular set of approaches for dealing 


with this problem are relaxation techniques 1 
[6, 9]. These methods are generally used on 
a pair of images; start with an educated guess 
for the correspondences; then update them by 
propagating constraints. These techniques don't 
always converge and don't always recover the 
correct correspondences. Another approach is 
to use multiple images. Several researchers, 
such as Yachida [11], have proposed trinocular 
stereo algorithms. Others have also used spe¬ 
cial camera configurations to aid in the corre¬ 
spondence problem, [10,1, 8]. Bolles, Baker and 
Marimont [1] proposed constructing an epipolar- 
plane image from a large number of images. I n 
some cases, analyzing the epi polar-plane image 
is much simpler than analyzing the original set 
of images. The epi polar-plane image, however, 
is only defined for a limited set of camera po¬ 
sitions. Tsai [10] and Okutomi and Kanade [8] 
defined a cost function which was applied di¬ 
rectly to a set of images. The extremum of this 
cost function was then taken as the correct cor¬ 
respondence. Occlusion is assumed to be negli¬ 
gible. In fact, Okutomi and Kanade state that 
they "invariably obtained better results by using 
relatively short baselines." This is likely the re¬ 
sult of using a spatial matching metric (a corre¬ 
lation window) and ignoring perspective distor¬ 
tion. Both methods used small sets of images, 
typically about ten. They also limited camera 
positions to special configurations. Tsai used a 
localized planar configuration with parallel op¬ 
tic axes; and Okutomi and Kanade used short 
linear configurations. Cox dt al [2] proposed a 
maxi mum-l ikel i hood framework for stereo pai rs, 
which they have extended to multiple images. 
This work attempts to explicitly model occlu¬ 
sions, although, in a somewhat ad hoc manner. 

11 uses a few global constraints and smalI sets of 
i mages. 

The work presented here also uses multi¬ 
ple images and draws its major inspiration from 
Bolles, Baker and Marimont [1]. We define a 
construct called an epi polar image and use it to 
analyze evidence about depth. Like Tsai [10] 
and Okutomi and Kanade [8] we define a cost 
function that is applied across multiple images, 
and like Cox [2] we model the occlusion pro¬ 
cess. There are several important differences, 

Tor a more complete and detailed analysis of this and 
other techniques see [5, 7, 4], 
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however. The epipolar image we define is valid 
for arbitrary camera positions and models some 
forms of occlusion. Our method is intended to 
recover dense depth maps of built geometry (ar¬ 
chitectural facades) using thousands of images 
acquired from within the scene. I n most cases, 
depth can be recovered using purely local in¬ 
formation, avoiding the computational costs of 
global constraints. Where depth cannot be recov¬ 
ered using purely local information, the depth 
evidence from the epipolar image provides a 
principled distribution for use in a maximum- 
likelihood approach [3]. 


2 Our Approach 

I n this section, we review epipolar geometry and 
epipolar-plane images, then define a new con¬ 
struct called an epipolar image. We also discuss 
the construction and analysisof epipolar images. 
Stereo techniques typically assume that relative 
camera positions and internal camera calibra¬ 
tions are known. This is sufficient to recover 
the structure of a scene, but without additional 
information the location of the scene cannot be 
determined. We assume that camera positions 
are known in a global coordinate system such as 
might be obtained from GPS (Global Positioning 
System). Although relative positions are suffi¬ 
cient for the discussion in this section, global po¬ 
sitions allow us to perform reconstruction incre¬ 
mentally using disjoint scenes. We also assume 
known internal camera calibrations. The nota¬ 
tion we use is defined in Table 1. 



Pi 

3D world point. 

Ci 

Center of projection for the*™ 
camera 

u\ 

1 mage plane. 

Pi 

Image point. Projection 
of P j onto nj . 

n fe 

Epipolar plane. 


Epipolar line. Projection of 
onto n?. 

SPk 

E pi pol ar pi ane i mage. 

Constructed usingn^. 

p* 

Base image point. Any point in 
any i mage. 

C* 

Base camera center. Camera 
center associated with p*. 

nf 

Base image. Containsp*. 

L 

Baseline. 3D line passing 
through p* and C*. 

£k 

Epipolar image. Constructed 
using p*. k indexes all possible 
P*'s. 

P(x) 

Function of the image at points 
(e.g. image intensities, 
correlation window, features). 

X(xi,X 2 ) 

Matching function. Measures 
match between x \ and x 2 
(large value better match). 

«) 

Match quality. Analyzed. 

{E\C} 

Set of all E's such that C is true. 

PlP-2 

Unit vector in the direction from 

PitoP 2 . 

d(p i) 

Depth of i mage poi nt p{ . 1 f 1 ow 
confidence or unknown, then oo. 

Mi 

Modeled object. Object whose 
position and geometry have 
already been reconstructed. 


Table 1: Notation used in this paper. 


Figure2: Epipolar geometry. 
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2.1 E pi polar Geometry 

E pi polar geometry provides a powerful stereo 
constraint. Given two cameras with known cen¬ 
ters Ci and C 2 and a point P in the world, the 
epipolar plane n e is defined as shown in Figure 
2. P projects to pi and p 2 on image planes II; 1 
andn? respectively. The projection of IT onto!! 1 
and n? produces epipolar lines i\ and 4- This 
is the essence of the epipolar constraint. Given 
any point p on epipolar line t\ in image!!; 1 , if the 
corresponding point is visible in image!!?, then 
it must lie on the corresponding epipolar line 4- 



F i gure 3: E pi pol ar-pl ane i mage geometry. 


2.2 E pi polar-Plane I mages 

Bolles, Baker and Marimont [1] used the epipo¬ 
lar constraint to construct a special image which 
they called an epi pol ar-pl ane image. As noted 
earlier, an epipolar line 4 contains all of the 
information about the epipolar plane IT that 
is present in the i th image nj. An epi polar- 
plane image is built using all of the epipolar 

lines |4,fc} from a se t ^ images {n?} which 
correspond to a particular epipolar plane n* 
(Figure 3). Since all of the lines in an 

epi pol ar-pl ane image £P k are projections of the 
same epipolar plane n£, for any given point p 
in £P k , if the corresponding point in any other 
image nf is visible, then it will also be in¬ 
cluded in SPk ■ Bolles, Baker and Marimont ex¬ 
ploited this property to solve the correspondence 
problem for several special cases of camera mo¬ 
tion. For example, with images taken at equally 
spaced points along a linear path perpendicu¬ 
lar to the optic axes, corresponding points form 
lines in the epi pol ar-pl ane image; therefore find¬ 


ing correspondences reduces to finding lines in 
the epi pol ar-pl ane image. 

For a given epipolar plane n*, only those 
images whose camera centers lie on n* 

| Qlle = oj) can be included in epipolar- 

plane image £P k . For example, using a set of 
images whose camera centers are coplanar, an 
epi pol ar-pl ane image can only be constructed for 
the epipolar plane containing the camera cen¬ 
ters. In other words, only a single epipolar 
line from each image can be analyzed using an 
epi pol ar-pl ane image. In order to analyze all 
of the points in a set of images using epi pol ar- 
pl ane images, all of the camera centers must be 
col I inear. This can be serious limitation. 



Figure4: Epipolar image geometry. 


2.3 E pi polar I mages 

For our analysis we will define an epipolar im¬ 
age £ which is a function of one image and a 
point in that image. An epipolar image is sim¬ 
ilar to an epi pol ar-pl ane image, but has one crit¬ 
ical difference that ensures it can be constructed 
for every pixel in an arbitrary set of images. 
Rather than use projections of a single epipolar 
plane, we construct the epipolar image from the 
pencil of epipolar planes defined by the line 
through one of the camera centers C* and one 
of the pixels p* in that image n* (Figure 4). 11* 
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is the epipolar plane formed by 4 and the * th 
camera center C*. Epipolar line 4 contains all 
of the information about 4 present in nj. An 
epi polar-plane image is composed of projections 
of a plane; an epipolar image is composed of pro¬ 
jections of a line. The cost of guaranteeing an 
epipolar image can be constructed for every pixel 
is that correspondence information is accumu¬ 
lated for only one point p*, instead of an entire 
epipolar line. 



Figure 5: Set of points which form a possible cor¬ 
respondence. 

To simplify the analysis of an epipolar im¬ 
age we can group points from the epipolar lines 
according to possible correspondences (Figure 
5). Pi projects to p| in n?; therefore {p]} has 
all of the information contained in {nf} about 
Pi. There is also a distinct set of points {p?} 

for P 2 ; therefore jpj | for a given j\ contains all 

of the possible correspondences for Pi- If Pi 
is a point on the surface of a physical object 
and it is visible in {n?} and n*, then measure¬ 
ments taken at p^ should match those taken 
at p* (Figure 6a). Conversely, if P ? is not a 
point on the surface of a physical object then 
the measurements taken at p? are unlikely to 
match those taken at p* (Figures 6b and 6c). 
Epipolar images can be viewed as tools for ac¬ 
cumulating evidence about possible correspon¬ 
dences of p*. A simple function of j is used to 
build {Pj I Vi < j : ||P* - C*|| 2 < ||Pj - C*|| 2 }. In 
essence, {P ? } is a set of samples along 4 at in- 





Figure6: Occlusion effects. 
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creasing depths from the image plane. 


2.4 Analyzi ng E pi polar I mages 

An epipolar imaged is constructed by organizing 

|^() is a function of the image! 

into a two-dimensional array with* and j as the 
vertical and horizontal axes respectively. Rows 
in £ are epipolar lines from different images; 
columns form sets of possible correspondences 
ordered by depth 2 (Figure 7). The quality Hj) 
of the match between column j and p* can be 
thought of as evidence that p* is the projection 
ofPj andj is its depth. Specifically: 

Hi) = J2 X ( :F ( pD^p*))’ d) 


where^O is a function of the image and X() is a 
cost function which measures the difference be¬ 
tween T(y>{) and JP(p*). A simple case is, 

T(x) = intensity values at x 

and 

X(X\,X2) = — |xi — X 2 ■ 

Real cameras are finite, and p^ may not be con¬ 
tained in the imagen? (pj £ {n(}J. Only terms 

for which p] e {n(} should be included in (1). To 
correct for this, u(j) is normalized, giving: 



*|Pi e { n i} 

E i 

*|Pi e { n i} 



Ideally, u(j) will have a sharp, distinct peak 
at the correct depth, so that 


argma x(u(j)) = the correct depth of p*. 

j 


As the number of elements in ip? | for a given j\ 

increases, the likelihood increases that u(j) will 
be large when P ; lies on a physical surface and 
small when it does not. Occlusions do not pro¬ 
duce peaks at incorrect depths or false posi¬ 
tives 3 . They can however, cause false negatives 

2 The depth of Pj can be trivially calculated from j, 
therefore we consider j and depth to be interchangeable. 
3 Except possibly in adversarial settings. 



£ (n*, p*) 


i (image) 


Figure 7: Constructing an epipolar image. 



Figure 8: False negative caused by occlusion. 
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or the absence of a peak at the correct depth (Fig¬ 
ure 8). A false negative is essentially a lack of 
evidence about the correct depth. Occlusions can 
reduce the height of a peak, but a dearth of con¬ 
curring images is required to eliminate the peak. 
Globally this produces holes in the data. While 
less then ideal, this is not a major issue and can 
be addressed in two ways: removing the contri¬ 
bution of occluded views, and adding unoccluded 
views by acquiring more images. 



where 



f 

p[e m ) 

5= < 

% 

CiPj • a < 0 1 


< 

d( P |) > ||C,-P^- 2 J 


Then, if sufficient evidence exists, 


arg niax(;y(j, a)) 

j,a 


j= depth Of p* 
a an esti mate of n :; 


One way to eliminate occlusions such as those 
shown in Figure 8 is to process the set of epipo¬ 
lar images {£ k } in a best first fashion. This is 
essentially building a partial model and then 
using that model to help analyze the difficult 
spots. u(j,a) is calculated using purely local 
operations. Another approach is to incorporate 
global constraints. 


Figure 9: Exclusion region for P ; . 

A large class of occluded views can be elim¬ 
inated quite simply. Figure 9 shows a point P ; 
and its normal n :; . Images with camera centers 
in the hashed half space cannot possibly view 
Pj. n j is not known a priori, but the fact that 
P j is visiblein n* limits its possible values. This 
range of values can then be sampled and used to 
eliminate occluded views from u(j). Let a be an 
estimate of n j and QP j be the unit vector along 
the ray from C, toPj, then P ; can only be visible 

if CjPj • a < 0. 


If the vicinity of {nj} is modeled (perhaps 
incompletely) by previous reconstructions, then 
this information can be used to improve the cur¬ 
rent reconstruction. Views for which the depth 4 
d(p{) at p{ is less than the distance from n? to 
P j can also be eliminated. For example, if Mi 
and M 2 have already been reconstructed, then 
i e {1,2,3} can be eliminated from v(j) (Figure 
8). The updated function becomes: 


V 


(.j ,«) 


p?'),^(p*)) 

ies 


Ei 

ies 



4 Distance from Uf to the closest previously recon¬ 
structed object or point along the ray starting at Q in the 
direction of pE 


3 Results 


Synthetic imagery was used to explore the char¬ 
acteristics of v{j) and u(j,a). A CAD model 
of Technology Square, the four-building complex 
housing our laboratory, was built by hand. The 
locations and geometries of the buildings were 
determined using traditional survey techniques. 
Photographs of the buildings were used to ex¬ 
tract texture maps which were matched with the 
survey data. This three-dimensional model was 
then rendered from 100 positions along a "walk 
around the block" (Figure 10). From this set of 
images, a nf and p* were chosen and an epi polar 
imaged constructed. £ was then analyzed using 
two match functions: 


V 



i |pje{nj} 

E i 

i |p;e{n?} 



and 



X;(c1P r a)^(p^'),^(p*)) 

ies 


ECi p ; •« 

ies 



where 


T(x) = hsv(x) 5 = [h(x), s(x), v(x)] T (6) 

5 hsv is the well known hue, saturation and value color 
model. 
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Figure 10: Examples of the rendered model. 


^([^l,Sl,Ul] T , [/l2,S2,W 2 ] T ) = (7) 

_ ^ ^ _ cog _ /j 2 )) _ 

(2 - Si - S 2 ) |ui - U 2 • 

Figures 11 and 12 show a base image nf with 
p* marked by a cross. Under nf is the epipolar 
image £ generated using the remaining 99 im¬ 
ages. Below £ is the matching function // (j ) (4) 
and u(j, a) (5). The horizontal scale, j or depth, 
is the same for £, u(j) and u(j,a). The vertical 
axis of £ is the image index, and of u(j,a) is a 
coarse estimate of the orientation a at P,. The 
vertical axis of u(j) has no significance; it is a 
single row that has been replicated to increase 
visibility. To the right, v{j) andi/(j,a) are also 
shown as two-dimensional plots 6 . 


Figure 11a shows the epipolar image that re¬ 
sults when the upper left-hand corner of thefore- 
ground building is chosen as p*. Near the bot¬ 
tom of £, 4 is close to horizontal, and pj is the 
projection of blue sky everywhere except at the 
building corner. The corner points show up in 
£ near the right side as a vertical streak. This 
is as expected since the construction of £ places 
the projections of P, in the same column. Near 
the middle of £, the long side to side streaks 

Actually, J2 a Kj»/E a 1 is Plotted for v{j,a). 


result because P ? is occluded, and near the top 
the large black region is produced because p| 0 
rq. Both u(j) and u(j, a) have a sharp peak 7 
that corresponds to the vertical stack of corner 
points. This peak occurs at a depth of 2375 
units {j = 321) for v(j) and a depth of 2385 
(j = 322) for u(j, a). The actual distance to the 
corner is 2387.4 units. The reconstructed world 
coordinates of p* are [—1441, —3084,1830] T and 
[—1438, —3077, 1837] t respectively. The actual co¬ 
ordinates 8 are [-1446, -3078,1846] T . 

Figure lib shows the epipolar image that re¬ 
sults when a point just on the dark side of the 
front left edge of the building is chosen as p*. 
Again both v(j) and v(j,a) have a single peak 
that agrees well with the depth obtained using 
manual correspondence. This time, however, the 
peaks are asymmetric and have much broader 
tails. This is caused by the high contrast be¬ 
tween the bright and dark faces of the building 
and the lack of contrast within the dark face. 
The peak in a) is slightly better than the one 

in u(j). 

Figure 11c shows the epipolar image that re¬ 
sults when a point just on the bright side of the 
front left edge of the building is chosen as p*. 


7 White indicates minimum error, black maximum. 

8 Some of the difference may be due to the fact that p* 
was chosen by hand and might not be the exact projection 
of the corner. 


7 



























































































b 



Figure 11: nf,p*,£, u{j) and v{j,a). 
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Figure 12: nf,p*,£, i/(j) andi/(j,a). 
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This time v[j) and u(j,a) are substantially dif¬ 
ferent. u(j) no longer has a single peak. The 
largest peak occurs at j = 370 and the next 
largest at j = 297. The manual measurement 
agrees with the peak at j = 297. The peak at 
j = 370 corresponds to the point where 4 exits 
the back side of the building. v(j,a), on the other 
hand, still has a single peak, clearly indicating 
the usefulness of estimating a. 

In Figure 12a, p* is a point from the inte¬ 
rior of a building face. There is a clear peak in 
u(j,a) that agrees well with manual measure¬ 
ments and is better than that in v(j). In Figure 
12b, p* is a point on a building face that is oc¬ 
cluded (Figure8) in a number of views. Both u(j) 
and u(j, a) produce fairly good peaks that agree 
with manual measurements. In Figure 12c, p* 
is a point on a building face with very low con¬ 
trast. In this case, neither u(j) nor v(j,a) pro¬ 
vide clear evidence about the correct depth. The 
actual depth occurs at j = 386. Both iA/j) and 
v(j,a) lack sharp peaks in large regions with lit¬ 
tle or no contrast or excessive occlusion. Choos¬ 
ing p* as a sky or ground pixel will produce a 
nearly constant v(j) or u(j, a). 

To further test our method, we reconstructed 
the depth of a region in one of the images (Fig¬ 
ure 13). For each pixel inside the black rectan¬ 
gle the global maximum of v{j, a) was taken as 
the depth of that pixel. Figure 14a shows the 
depth for each of the 3000 pixels reconstructed 



plotted against the x image coordinate of the 
pixel. Slight thickening iscaused by thefactthat 
depth changes slightly with the y image coordi¬ 
nate. The cluster of points at the left end (near 
a depth of 7000) and at the right end correspond 
to sky points. The actual depth for each pixel 


was calculated from the CAD model. Figure 14b 
shows the actual depths (in grey) overlaid on top 
of the reconstructed values. Figure 15 shows the 
same data ploted in world coordinates 9 . The ac¬ 
tual building faces are drawn in grey, and the 
camera position is marked by a grey Iineextend¬ 
ing from the center of projection in the direction 
of the optic axis. The reconstruction shown (Fig¬ 
ures 14 and 15) was performed purely locally at 
each pixel. Global constraints such as ordering 
or smoothness were not i mposed, and no attempt 
was made to remove low confidence depths or 
otherwise post-process the global maximum of 

v(j, a). 


4 Conclusions 

This paper describes a method for generating 
dense depth maps directly from large sets of im¬ 
ages taken from arbitrary positions. The algo¬ 
rithm presented uses only local calculations, is 
simple and accurate. Our method builds, then 
analyzes, an epipolar image to accumulate evi¬ 
dence about the depth at each image pixel. This 
analysis produces an evidence versus depth and 
surface normal distribution that in many cases 
contains a clear and distinct global maximum. 
The location of this peak determines the depth, 
and its shape can be used to estimate the er¬ 
ror. The distribution can also be used to per¬ 
form a maximum likelihood fit of models to the 
depth map. We anticipate that the ability to 
perform maximum likelihood estimation from 
purely local calculations will prove extremely 
useful in constructing three-dimensional models 
from I arge sets of i mages. 
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9 AII coordinates have been divided by 1000 to simplify 
the plots. 


10 











depth (world coordinates) depth (world coordinal 


6000 


4000 


2000 


0 



100 200 300 

x (image coordinates) 


400 


8000 


6000 


4000 


2000 


0 


•rss 


* '4s* 


♦ + 


100 200 300 400 

x (image coordinates) 


b 


Figure 14: Reconstructed and actual depth 



Figure 15: Reconstructed and actual world 
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